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Abstract 

The  M-Machine  is  an  experimental  multicomputer  being  developed  to  test  architectural  concepts  moti¬ 
vated  by  the  constraints  of  modern  semiconductor  technology  and  the  demands  of  programming  systems. 
The  M-Machine  computing  nodes  are  connected  with  a  3-D  mesh  network;  each  node  is  a  multithreaded 
processor  incorporating  12  function  units,  on-chip  cache,  and  local  memory.  The  multiple  function  units 
are  used  to  exploit  both  instruction-level  and  thread-level  parallelism.  A  user  accessible  message  passing 
system  yields  fast  communication  and  synchronization  between  nodes.  Rapid  access  to  remote  memory 
is  provided  transparently  to  the  user  with  a  combination  of  hardware  and  software  mechanisms.  This 
paper  presents  the  architecture  of  the  M-Machine  and  describes  how  its  mechanisms  maximize  both 
single  thread  performance  and  overall  system  throughput. 
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1  Introduction 

Because  of  the  increasing  density  of  VLSI  integrated  cir¬ 
cuits,  most  of  the  chip  area  of  modern  computers  is  now 
occupied  by  memory  and  not  by  processing  resources. 
The  M-Machine  is  an  experimental  multicomputer  be¬ 
ing  developed  to  test  architecture  concepts  which  are 
motivated  by  these  constraints  of  modern  semiconduc¬ 
tor  technology  and  the  demands  of  programming  sys¬ 
tems,  such  as  faster  execution  of  fixed  sized  problems 
and  easier  programmability  of  parallel  computers. 

Advances  in  VLSI  technology  have  resulted  in  com¬ 
puters  with  chip  area  dominated  by  memory  and  not 
by  processing  resources.  The  normalized  area  (in  A2) 
of  a  VLSI  chip1  is  increasing  by  50%  per  year,  while 
gate  speed  and  communication  bandwidth  are  increas¬ 
ing  by  20%  per  year  [10].  As  a  result,  a  64-bit  processor 
with  a  pipelined  FPU  (400MA2)  is  only  11%  of  a  3.6GA2 
1993  0.5//m  chip  and  only  4%  of  a  10GA2  1996  0.35//m 
chip.  In  a  system  with  64  MBytes  (256  MBytes  in  1996) 
of  DRAM,  the  processor  accounts  for  0.52%  (0.13%  in 
1996)  of  the  silicon  area  in  the  system.  The  memory 
system,  cache,  TLB,  controllers,  and  DRAM  account 
for  most  of  the  remaining  area.  Technology  scaling  has 
made  the  memory,  rather  than  the  processor,  the  most 
area-consuming  resource  in  a  computer  system. 

To  address  this  imbalance,  the  M-Machine  increases 
the  fraction  of  chip  area  devoted  to  processor,  to 
make  better  use  of  the  critical  memory  resources.  An 
M-Machine  multi-ALU  processor  (map)  chip  contains 
four  64-bit  three-issue  clusters  that  comprise  32%  of  the 
5GA2  chip  and  11%  of  an  8  MByte  (six-chip)  node.  The 
multiple  execution  clusters  provide  better  performance 
than  using  a  single  cluster  and  a  large  on-chip  cache  in 
the  same  chip  area.  The  high  ratio  of  arithmetic  band¬ 
width  to  memory  bandwidth  (12  operations/word)  al¬ 
lows  the  map  to  saturate  the  costly  DRAM  bandwidth 
even  on  code  with  high  cache-hit  ratios.  A  32-node 
M-Machine  system  with  256  MBytes  of  memory  has 
128  times  the  peak  performance  of  a  1996  uniprocessor 
with  the  same  memory  capacity  at  1.5  times  the  area,  a 
85:1  improvement  in  peak  performance/area.  Even  at  a 
small  fraction  of  this  peak  performance,  such  a  machine 
allows  the  costly,  fixed-sized  memory  to  handle  more 
problems  per  unit  time  resulting  in  more  cost-effective 
computing. 

The  M-Machine  is  designed  to  extract  more  paral¬ 
lelism  from  problems  of  a  fixed  size,  rather  than  requir¬ 
ing  enormous  problems  to  achieve  peak  performance.  To 
do  this,  nodes  are  designed  to  manage  parallelism  from 
the  instruction  level  to  the  process  level.  The  12  func¬ 
tion  units  in  a  single  M-Machine  node  are  controlled 
using  a  form  of  Processor  Coupling  [13]  to  exploit  in- 

xThe  parameter  A  is  a  normalized,  process  independent 
unit  of  distance  equivalent  to  one  half  of  the  gate  length  [18]. 
For  a  O.S^ra  process,  A  is  0.25^ra. 


struct  ion  level  parallelism  by  executing  12  operations 
from  the  same  thread,  or  to  exploit  thread-level  paral¬ 
lelism  by  executing  operations  from  up  to  six  different 
threads.  The  fast  internode  communication  allows  col¬ 
laborating  threads  to  reside  on  different  nodes. 

The  M-Machine  also  addresses  the  demand  for  eas¬ 
ier  programmability  by  providing  a  incremental  path  for 
increasing  parallelism  and  performance.  An  unmodi¬ 
fied  sequential  program  can  run  on  a  single  M-Machine 
node,  accessing  both  local  and  remote  memory.  This 
code  can  be  incrementally  parallelized  by  identifying 
tasks,  such  as  loop  iterations,  that  can  be  distributed 
both  across  nodes  and  within  each  node  to  run  in  par¬ 
allel.  A  flat,  shared  address  space  simplifies  naming 
and  communication.  The  local  caching  of  remote  data 
in  local  DRAM  automatically  migrates  a  task’s  data  to 
exploit  locality. 

The  remainder  of  this  paper  describes  the  M- 
Machine  in  more  detail.  Section  2  gives  an  overview 
of  the  machine  architecture.  Mechanisms  for  intra- 
node  parallelism  are  described  in  Section  3.  Section  4 
discusses  inter-node  communication  including  the  user- 
level  communication  primitives  and  how  they  are  used 
to  provide  global  coherent  memory  access. 

2  M-Machine  Architecture 

The  M-Machine  consists  of  a  collection  of  computing 
nodes  interconnected  by  a  bidirectional  3-D  mesh  net¬ 
work,  as  shown  in  Figure  1.  Each  six-chip  node  consists 
of  a  multi-ALU  (map)  chip  and  1  MW  (8  MBytes)  of 
synchronous  DRAM  (SDRAM).  The  MAP  chip  includes 
the  network  interface  and  router,  and  it  provides  an 
equal  bandwidth  of  800  MBytes/s  to  the  local  SDRAM 
and  to  each  network  channel.  I/O  devices  may  be  con¬ 
nected  either  to  an  I/O  bus  available  on  each  node,  or 
to  I/O  nodes  (IONs)  attached  to  the  face  channels. 

As  shown  in  Figure  2,  a  MAP  contains:  four  execu¬ 
tion  clusters,  a  memory  subsystem  comprised  of  four 
cache  banks  and  an  external  memory  interface,  and  a 
communication  subsystem  consisting  of  the  network  in¬ 
terfaces  and  the  router.  Two  crossbar  switches  inter¬ 
connect  these  components.  Clusters  make  memory  re¬ 
quests  to  the  appropriate  bank  of  the  interleaved  cache 
over  the  150-bit  wide  (address-j-data)  4x4  M— Switch. 
The  90-bit  wide  10x4  C-S witch  is  used  for  inter-cluster 
communication  and  to  return  data  from  the  memory 
system.  Both  switches  support  up  to  four  transfers  per 
cycle. 

MAP  Execution  Clusters:  Each  of  the  four  map 
clusters  is  a  64-bit,  three-issue,  pipelined  processor  con¬ 
sisting  of  two  integer  ALUs,  a  floating-point  ALU,  as¬ 
sociated  register  files,  and  a  1KW  (8KB)  instruction 
cache,  as  shown  in  Figure  3.  One  of  the  integer  ALUs 
in  each  cluster,  termed  the  memory  unit,  serves  as  in¬ 
terface  to  the  memory  system.  Each  map  instruction 
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Figure  1:  The  M-Machine  architecture. 
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Figure  2:  The  map  architecture. 
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Figure  3:  A  MAP  cluster  consists  of  3  execution  units,  2  register  files,  an  instruction  cache  and  ports  onto  the 
memory  and  cluster  switches. 


contains  1,  2,  or  3  operations,  one  for  each  ALU.  All 
operations  in  a  single  instruction  issue  together  but  may 
complete  out  of  order. 

Memory  System:  As  illustrated  in  Figure  2,  the  on- 
chip  cache  is  organized  as  four  word-interleaved  4KW 
(32KB)  banks  to  permit  four  consecutive  word  accesses 
to  proceed  in  parallel.  The  cache  is  virtually  addressed 
and  tagged.  The  cache  banks  are  pipelined  with  a  three- 
cycle  read  latency,  including  switch  traversal. 

The  external  memory  interface  consists  of  the 
SDRAM  controller  and  a  local  translation  lookaside 
buffer  (LTLB)  used  to  cache  local  page  table  (LPT)  en¬ 
tries.  Pages  are  512  words  (64  8-word  cache  blocks). 
The  SDRAM  controller  exploits  the  pipeline  and  page 
mode  of  the  external  memory  and  performs  SECDED2 
error  control. 

A  synchronization  bit  is  associated  with  each  word  of 
memory.  Special  load  and  store  operations  may  specify 
a  precondition  and  a  postcondition  on  the  synchroniza¬ 
tion  bit.  These  are  the  only  atomic  read-modify-write 
memory  operations. 

The  M-Machine  supports  a  single  global  virtual  ad¬ 
dress  space.  A  light-weight  capability  system  imple- 

2  Single  error  correcting,  double  error  detecting 


ments  protection  through  guarded  pointers  [3],  while 
paging  is  used  to  manage  the  relocation  of  data  in  phys¬ 
ical  memory  within  the  virtual  address  space.  The  seg¬ 
mentation  and  paging  mechanisms  are  independent  so 
that  protection  may  be  preserved  on  variable-size  seg¬ 
ments  of  memory.  The  memory  subsystem  is  integrated 
with  the  communication  system  and  can  be  used  to 
access  memory  on  remote  nodes,  as  described  in  Sec¬ 
tion  4.2. 

Communication  Subsystem:  Messages  are  com¬ 
posed  in  the  general  registers  of  a  cluster  and  launched 
atomically  using  a  user-level  SEND  instruction.  Protec¬ 
tion  is  provided  by  sending  a  message  to  a  virtual  mem¬ 
ory  address  that  is  automatically  translated  to  the  des¬ 
tination  node  identifier  by  a  global  translation  lookaside 
buffer  (GTLB),  which  caches  entries  of  a  global  desti¬ 
nation  table  (GDT).  Arriving  messages  are  queued  in  a 
register-mapped  hardware  FIFO  readable  by  a  system- 
level  message  handler.  Two  network  priorities  are  pro¬ 
vided,  one  for  requests  and  one  for  replies. 

3  Intra— node  Concurrency  Mechanisms 

The  amount  and  granularity  of  parallelism  varies  enor¬ 
mously  across  application  programs  and  even  during  dif- 
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ferent  phases  of  the  same  program.  Some  phases  have  an 
abundance  of  instruction  level  parallelism  that  can  be 
extracted  at  compile  time.  Others  have  data  dependent 
parallelism  that  can  be  executed  using  multiple  threads 
with  widely  varying  task  sizes. 

The  M-Machine  is  designed  to  efficiently  execute  pro¬ 
grams  with  any  or  all  granularities  of  parallelism.  On 
the  MAP,  parallel  instruction  sequences  ( H-Threads )  are 
run  concurrently  on  the  four  clusters  to  exploit  ILP 
across  all  12  of  the  function  units.  Alternatively  they 
may  be  used  to  exploit  loop  level  parallelism.  To  exploit 
thread-level  parallelism  and  to  mask  variable  pipeline, 
memory,  and  communication  delays,  the  MAP  inter¬ 
leaves  the  12- wide  instruction  streams  from  different 
tasks,  V-Threads,  within  each  cluster  on  a  cluster-by¬ 
cluster  and  cycle-by-cycle  basis,  thus  sharing  the  execu¬ 
tion  resources  among  all  active  tasks. 

This  arrangement  of  V-Threads  (Vertical  Threads) 
and  H-Threads  (Horizontal  Threads)  is  summarized  in 
Figure  4.  Six  V-Threads  are  resident  in  the  cluster  reg¬ 
ister  files.  Each  V-Thread  consists  of  four  H-Threads, 
one  on  each  cluster.  Each  H-Thread  consists  of  a  se¬ 
quence  of  3- wide  instructions  containing  integer,  mem¬ 
ory,  and  floating  point  operations.  On  each  cluster  the 
H-Threads  from  the  different  V-Threads  are  interleaved 
over  the  execution  units. 

3.1  H-Threads 

An  H-Thread  runs  on  a  single  cluster  and  executes  a 
sequence  of  operation  triplets  (one  operation  for  each 
of  the  3  ALUs  in  the  cluster)  that  are  issued  simultane¬ 
ously.  Within  an  H-Thread,  instructions  are  guaranteed 
to  issue  in  order,  but  may  complete  out  of  order.  An 
H-Thread  may  communicate  and  synchronize  via  regis¬ 
ters  with  the  3  other  H-Threads  in  the  same  V-Thread, 
each  executing  on  a  separate  cluster.  Each  H-Thread 
reads  operands  from  its  own  register  file,  but  can  di¬ 
rectly  write  to  the  register  file  of  any  H-Thread  in  its 
own  V-Thread. 

H-Threads  support  multiple  execution  models.  They 
can  execute  as  independent  threads  with  possibly  dif¬ 
ferent  control  flows  to  exploit  loop-level  or  thread-level 
parallelism.  Alternatively,  the  compiler  can  schedule 
the  four  H-Threads  in  a  V-Thread  as  a  unit  to  exploit 
instruction  level  parallelism,  as  in  a  VLIW  machine. 
In  this  case  the  compiler  must  insert  explicit  register- 
based  synchronization  to  enforce  instruction  ordering 
between  H-Threads.  Unlike  the  lock-step  execution  of 
traditional  VLIW  machines,  H-Thread  synchronization 
occurs  infrequently,  only  being  required  by  data  or  re¬ 
source  dependencies,  While  explicit  synchronization  in¬ 
curs  some  overhead,  it  allows  H-Threads  to  slip  relative 
to  each  other  in  order  to  accommodate  variable-latency 
operations  such  as  memory  accesses. 

Figure  5  shows  an  illustrative  example  of  the  in¬ 
struction  sequences  of  a  program  fragment  on  1  and 


2  H-Threads.  The  program  is  the  body  of  the  inner 
loop  of  a  “smoothing”  operation  using  a  7-point  stencil 
on  3-D  grid.  On  a  particular  grid  point,  the  smoothed 
value  is  given  by  u*  =  u.  4-  axr*  +  b  x  (ru  +  rj  +  rn 
+  rs  +  re  +  rw),  where  r*  is  the  residual  value  at  that 
point,  and  ru,  rj,  rn,  rs,  rs  and  rw  are  the  residuals 
at  the  neighboring  grid  points  in  the  six  directions  UP, 
down,  north,  south,  east  and  WEST  respectively. 
In  order  to  better  illustrate  the  use  of  H-Threads,  ad¬ 
vanced  optimization  (such  as  software  pipelining)  is  not 
performed. 

Figure  5(a)  shows  the  single  H-Thread  program,  with 
a  12  long  instruction  stream  which  includes  all  of  the 
memory  and  floating  point  operations.  The  weighting 
constants  a  and  b  are  kept  in  registers.  Figure  5(b) 
shows  the  instruction  streams  for  two  H-Threads  work¬ 
ing  cooperatively.  Each  H-Thread  performs  four  mem¬ 
ory  operations  and  some  of  the  arithmetic  calculations. 
Instruction  7  in  H-Thread  0  calculates  a  partial  sum 
and  transmits  it  directly  to  register  t2  in  H-Thread  1. 
The  empty  instruction  on  H-Thread  1  is  used  to  prepare 
t2  for  H-Thread  synchronization;  H-Thread  1  will  not 
issue  instruction  7  until  the  data  arrives  from  H-Thread 
0  as  explained  below. 

The  use  of  multiple  H-Threads  reduces  the  static 
depth  of  the  instruction  sequences  from  12  to  8.  On 
a  larger  27-point  stencil,  the  depth  is  reduced  from  36 
to  17  when  run  on  4  H-Threads.  The  actual  execu¬ 
tion  time  of  the  program  fragments  will  depend  on  the 
pipeline  and  memory  latencies. 

H-Thread  Synchronization 

As  shown  in  the  example  of  Figure  5,  H-Threads  syn¬ 
chronize  through  registers.  A  scoreboard  bit  associated 
with  the  destination  register  is  cleared  (empty)  when 
a  multicycle  operation,  such  as  a  load,  issues  and  set 
(full)  when  the  result  is  available.  An  operation  that 
uses  the  result  will  not  be  selected  for  issue  until  the 
corresponding  scoreboard  bit  is  set. 

Inter-cluster  data  transfers  require  explicit  register 
synchronization.  To  prepare  for  inter-cluster  data  trans¬ 
fers,  the  receiving  H-Thread  executes  an  EMPTY  op¬ 
eration  to  mark  empty  a  set  of  destination  registers. 
As  each  datum  arrives  from  the  transmitting  H-Thread 
over  the  C-Switch,  the  corresponding  destination  regis¬ 
ter  is  set  full.  An  instruction  in  the  receiving  H-Thread 
that  uses  the  arriving  data  will  be  not  eligible  for  issue 
until  its  data  is  available. 

Four  pairs  of  single-bit  global  condition  code  (CC) 
registers  are  used  to  broadcast  binary  values  across  the 
clusters.  Unlike  centrally  located  global  registers,  the 
MAP  global  CC  registers  are  physically  replicated  on 
each  of  the  clusters.  A  cluster  may  broadcast  using 
either  register  in  only  one  of  the  four  pairs,  but  may 
read  and  empty  its  local  copy  of  any  global  CC  register. 
Using  these  registers,  all  four  H-Threads  can  execute 
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Figure  4:  Multiple  V-Threads  are  interleaved  dynamically  over  the  cluster  resources.  Each  V  Thread  consists  of  4 
H-Threads  which  execute  on  different  clusters. 


conditional  branches  and  assignment  operations  based 
on  a  comparison  performed  in  a  single  cluster. 

The  scoreboard  bits  associated  with  the  global  CC 
registers  may  be  used  to  rapidly  synchronize  the 
H-Threads  within  a  V-Thread.  Figure  6  shows  an  ex¬ 
ample  of  two  H-Threads  synchronizing  at  loop  bound¬ 
aries.  Two  registers  are  involved  in  the  synchronization, 
in  order  to  provide  an  interlocking  mechanism  ensuring 
that  neither  H-Thread  rolls  over  into  the  next  loop  it¬ 
eration. 

H-Thread  0  computes  bar,  compares  it  (using  eq) 
to  end,  and  broadcasts  the  result  by  targetting  gccl. 
H-Thread  1  uses  gccl  to  determine  whether  to  branch, 
marks  gccl  empty  again,  and  writes  to  gcc3  to  notify 
H-Thread  0  that  the  current  value  of  gccl  has  been 
consumed.  H-Thread  0  blocks  until  gcc3  is  full,  and 
then  empties  it  for  the  next  iteration.  Neither  thread 
can  proceed  with  the  next  iteration  until  both  have  com¬ 
pleted  the  current  one.  Due  to  the  multicopy  structure 
of  MAP  global  CC  registers,  this  protocol  can  easily  be 
extended  to  perform  a  fast  barrier  among  4  H-Threads 
executing  on  different  clusters,  without  combining  or 
distribution  trees. 


3.2  V— Threads 

A  V-Thread  (vertical  thread)  consists  of  4  H-Threads, 
each  running  concurrently  on  a  different  cluster.  As 
discussed  above,  H-Threads  within  the  same  V-Thread 
may  communicate  via  registers.  However,  H-Threads 
in  different  V-Threads  may  only  communicate  and  syn¬ 
chronize  through  messages  or  memory.  The  MAP  has 
enough  resources  to  hold  the  state  of  six  V-Threads, 
each  one  occupying  a  thread  slot .  Four  of  these  slots  are 
user  slots,  one  is  the  event  slot,  and  one  is  the  excep¬ 
tion  slot.  User  threads  run  in  the  user  slots,  handlers 
for  asynchronous  events  and  messages  run  in  the  event 
slot,  and  handlers  for  synchronous  exceptions  detected 
within  a  cluster,  such  as  protection  violations,  run  in 
the  exception  slot. 

On  each  cluster,  six  H-Threads  (one  from  each 
V-Thread)  are  interleaved  dynamically  over  the  cluster 
resources  on  a  cycle- by-cycle  basis.  A  synchronization 
pipeline  stage  holds  the  next  instruction  to  be  issued 
from  each  of  the  six  V— Threads  until  all  of  its  operands 
are  present  and  all  of  the  required  resources  are  avail¬ 
able  [13].  At  every  cycle  this  stage  decides  which  in¬ 
struction  to  issue  from  those  which  are  ready  to  run. 
An  H-Thread  that  is  stalled  waiting  for  data  or  resource 
availability  consumes  no  resources  other  than  the  thread 
slot  that  holds  its  state.  As  long  as  its  data  and  resource 
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(a)Single  H-Thread 


MEM  Unit 

FP  Unit 

1. 

load  ru 

2. 

load  r^ 

3. 

load  rn 

t2  =  ru  +  rj 

4. 

load  rs 

t2  =  t2  4-  rn 

5. 

load  re 

t2  =  t2  +  rs 

6. 

load  rw 

t2  =  t2  +  re 

7. 

load  r* 

t2  =  t2  +  Tw 

8. 

load  u* 

t2  =  b  x  t2 

9. 

ti  =  a  x  r* 

10. 

ti  =  ti  +  t2 

11. 

u*  —  u*  4*  ti 

12. 

store  u* 

(b)  Two  concurrent  H-Threads 


H— Thread  0 

MEM  Unit 

FP  Unit 

1. 

load  r u 

2. 

load  rj 

3. 

load  r* 

t2  =  ru  +  rd 

4. 

load  u* 

t2  =  b  x  t2 

5. 

ti  =  a  x  r* 

6. 

ti  =  u*  4*  ti 

7. 

Hl.t2  =  ti  4-  t2 

H-Thread  1 

MEM  Unit 

FP  Unit 

i. 

load  rn 

2. 

load  rs 

empty  t2 

3. 

load  re 

ti  =  rn  4-  rs 

4. 

load  rw 

ti  =  ti  4-  re 

5. 

ti  =  ti  4*  rw 

6. 

ti  =  b  x  ti 

7. 

u*  =  ti  4-  t2 

8. 

store  u* 

Figure  5:  Example  of  H-Threads  used  to  exploit  instruction  level  parallelism:  (a)  single  H-Thread,  (b)  two 
H-Threads.  The  computation  is  a  smoothing  operator  using  a  7-point  stencil  on  a  3-D  grid:  u*  =  u*  4-  axr* 
+  b  x  (ru  +  rj  -h  rn  +  rs  +  re  +  rw). 


H-Thread  0 

1  LOOP_0 :  compute  bar 


H-Thread  1 

loop_i  :  compute 


eq  bar  end  gccl 


branch  delay 
slots 


br  gccl  LOOP_0 
use  gcc3 
empty  gcc3 


br  gccl  L00P„1 
empty  gccl\  brmdldela) 
write  gcc3  I  ,too 


Figure  6:  Loop  synchronization  between  two  H-Threads  using  MAP  global  CC  registers. 
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dependencies  are  satisfied,  a  single  thread  may  issue  an 
instruction  every  cycle.  Multiple  V-Threads  may  be  in¬ 
terleaved  with  zero  delay,  which  allows  task  switching 
to  be  used  to  mask  even  very  short  pipeline  latencies 
as  well  as  longer  communication  and  synchronization 
latencies. 

3.3  Asynchronous  Exception  Handling 

Exceptions  that  occur  outside  the  MAP  cluster  are  han¬ 
dled  asynchronously  by  generating  an  event  record  and 
placing  it  in  a  hardware  event  queue.  LTLB  misses, 
block  status  faults,  and  memory  synchronizing  faults, 
for  example,  are  handled  asynchronously.  These  excep¬ 
tions  are  precise  in  the  sense  that  the  faulting  operation 
and  its  operands  are  specifically  identified  in  the  event 
record,  but  they  are  handled  asynchronously,  without 
stopping  the  thread. 

A  dedicated  handler  in  an  H-Thread  of  the  event 
V-Thread  processes  event  records  to  complete  the  fault¬ 
ing  operations.  The  event  handler  loops,  reading  event 
records  from  the  register-mapped  queue  and  processing 
them  in  turn.  A  read  from  the  queue  will  not  issue  if 
the  queue  is  empty.  For  example,  on  a  local  TLB  miss, 
the  hardware  formats  and  enqueues  an  event  record  con¬ 
taining  the  faulting  address  els  well  as  the  write  data  or 
read  destination.  A  TLB  miss  handler  reads  the  record, 
places  the  requested  page  table  entry  in  the  TLB,  and 
restarts  the  memory  reference.  The  thread  that  issued 
the  reference  does  not  block  until  it  needs  the  data  from 
the  reference  that  caused  the  miss.  Inter-node  message 
arrival  is  treated  as  an  event  in  which  the  contents  of  the 
message  are  written  into  the  appropriate  event  queue 
(which  serves  as  the  message  queue). 

Each  H-Thread  in  the  event  V-Thread  handles  one 
class  of  events.  Memory  synchronization  and  status 
faults  are  run  on  cluster  0,  local  TLB  misses  are  run 
on  cluster  1,  and  arriving  messages  are  run  on  clusters 
2  and  3,  depending  on  the  priority  of  the  message. 

Handling  exceptions  asynchronously  obviates  the 
need  to  cancel  all  of  the  issued  operations  following  the 
faulting  operation,  a  significant  penalty  in  a  12-wide 
machine  with  deep  pipelines.  Dedicating  H-Threads 
to  this  purpose  accelerates  event  handling  by  elimi¬ 
nating  the  need  to  save  and  restore  state,  and  allows 
concurrent  (interleaved)  execution  of  user  threads  and 
event  handlers.  Asynchronous  event  handling  does  re¬ 
quire  sufficient  queue  space  to  handle  the  case  where 
every  outstanding  instruction  generates  an  exception. 
To  reduce  queue  size  requirements,  exceptions  that  are 
detected  in  the  first  execution  cycle,  such  as  protec¬ 
tion  violations  and  some  arithmetic  exceptions,  stall  all 
user  H-Threads  in  the  affected  cluster,  and  are  handled 
synchronously  by  the  local  H— Thread  of  the  exception 
V-Thread. 


3.4  Discussion 

There  are  two  major  methods  of  exploiting  instruction 
level  parallelism.  Superscalar  processors  execute  mul¬ 
tiple  instructions  simultaneously  by  relying  upon  run¬ 
time  scheduling  mechanisms  to  determine  data  depen¬ 
dencies  [23,  12].  However,  they  do  not  scale  well  with 
increasing  number  of  function  units  because  a  greater 
number  of  register  file  ports  and  connections  to  the 
function  units  are  required.  In  addition,  superscalars 
attempt  to  schedule  instructions  at  runtime  (much  of 
which  could  be  done  at  compile  time),  but  they  can  only 
examine  a  small  subsequence  of  the  instruction  stream. 

Very  Long  Instruction  Word  (VLIW)  processors  such 
as  the  Multiflow  Trace  series  [4]  use  only  compile  time 
scheduling  to  manage  instruction-level  parallelism,  re¬ 
source  usage,  and  communication  among  a  partitioned 
register  file.  However,  the  strict  lock-step  execution  is 
unable  to  tolerate  the  dynamic  latencies  found  in  mul¬ 
tiprocessors. 

Processor  Coupling  was  originally  introduced  in  [13] 
and  used  implicit  synchronization  between  the  clusters 
on  every  wide  instruction.  Relaxing  the  synchroniza¬ 
tion,  as  described  in  this  section,  has  several  advantages. 
First,  it  is  easier  to  implement  because  control  is  local¬ 
ized  completely  within  the  clusters.  Second,  it  allows 
more  slip  to  occur  between  the  instruction  streams  run¬ 
ning  on  different  clusters  (H-Threads),  which  eliminates 
the  automatic  blocking  of  one  thread  on  long  latency 
operations  of  another,  providing  more  opportunity  for 
latency  tolerance.  Finally,  the  H— Threads  can  be  used 
flexibly  to  exploit  both  instruction  and  loop  level  paral¬ 
lelism.  When  H-Threads  must  synchronize,  they  do  so 
explicitly  though  registers,  at  a  higher  cost  than  implicit 
synchronization.  However,  fewer  synchronization  oper¬ 
ations  are  required,  and  many  of  them  can  be  included 
in  data  transfer  between  clusters. 

Using  multiple  threads  to  hide  memory  latencies  and 
pipeline  delays  has  been  explored  in  several  different 
studies  and  machines.  Gupta  and  Weber  explore  the 
use  of  multiple  hardware  contexts  in  multiprocessors  [8], 
but  the  context  switch  overhead  prevents  the  masking 
of  pipeline  latencies.  MAS  A  [9]  as  well  as  HEP  [22] 
use  fine  grain  multithreading  to  issue  an  instruction 
from  a  different  context  on  every  cycle  in  order  to  mask 
pipeline  latencies.  However,  with  the  required  round- 
robin  scheduling,  single  thread  performance  is  degraded 
by  the  number  of  pipeline  stages.  The  zero  cost  switch¬ 
ing  among  V-Threads  and  the  pipeline  design  of  the 
MAP  provide  fast  single  thread  execution  as  well  as  la¬ 
tency  tolerance  for  better  local  memory  bandwidth  uti¬ 
lization. 

4  Inter-node  Concurrency  Mechanisms 

The  M-Machine  provides  a  fast,  protected,  user-level 
message  passing  substrate.  A  user  program  may  com- 
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municate  and  synchronize  by  directly  sending  messages 
or  by  reading  and  writing  remote  memory  using  a  co¬ 
herent  shared  memory  system  layered  on  the  message- 
passing  substrate.  Direct  messaging  provides  maximum 
performance  data  transfer  and  synchronization  while 
shared  memory  access  simplifies  programming.  Remote 
memory  access  is  implemented  using  fast  trap  handlers 
that  intercept  load  and  store  operations  that  reference 
remote  data.  These  handlers  send  messages  to  other 
nodes  to  complete  remote  memory  references  transpar¬ 
ently  to  user  programs.  Additional  hardware  and  soft¬ 
ware  mechanisms  allow  remote  data  to  be  cached  locally 
in  both  the  cache  and  external  memory. 

4.1  Message  Passing  Support 

The  M-Machine  provides  hardware  support  for  inject¬ 
ing  a  message  into  the  network,  determining  the  mes¬ 
sage  destination,  and  dispatching  a  handler  on  message 
arrival.  For  example,  Figure  7  shows  the  M— Machine 
instruction  sequences  for  both  the  sending  and  receiv¬ 
ing  components  of  a  remote  memory  store.  The  mes¬ 
sage  sending  sequence  (Figure  7(a))  loads  the  data  to  be 
stored  into  general  register  MCI.  The  SEND  instruction 
takes  three  arguments,  the  target  address  (Raddr),  the 
dispatch  instruction  pointer  (Rdip),  and  the  message 
body  length  (#l).  When  the  SEND  issues,  the  Global 
Translation  Lookaside  Buffer  (GTLB)  translates  virtual 
address  Raddr  into  a  physical  node  identifier  and  sends 
that  node  a  3  word  message  containing  Rdip,  Raddr,  and 
MCI.  When  the  message  arrives  at  the  destination  (Fig¬ 
ure  7(b))  hardware  enqueues  it  in  the  priority  0  message 
queue.  An  H-Thread  dedicated  to  message  handling 
jumps  to  the  handler  via  Rdip,  executes  a  store  opera¬ 
tion  and  branches  back  to  the  dispatch  portion  of  the 
code. 

Message  Injection:  A  message  is  composed  in  a  clus¬ 
ter’s  general  registers  and  transmitted  atomically  with 
a  single  SEND  instruction  that  takes  as  arguments  a  des¬ 
tination  virtual  address,  a  dispatch  instruction  pointer 
(DIP),  and  the  message  body  length.  Hardware  com¬ 
poses  the  message  by  prepending  the  destination  and 
DIP  to  the  message  body  and  injects  in  into  the  net¬ 
work.  Two  message  priorities  are  provided:  user  mes¬ 
sages  are  sent  at  priority  zero,  while  priority  1  is  used 
for  system  level  message  reply,  thus  avoiding  deadlock. 

Message  Address  Translation:  As  de¬ 
scribed  in  [19],  the  explicit  management  of  processor 
identifiers  by  application  programs  is  cumbersome  and 
slow.  To  eliminate  this  overhead,  the  map  implements 
a  Global  Translation  Lookaside  Buffer  (GTLB),  backed 
by  a  software  Global  Destination  Table  (GDT),  to  hold 
mappings  of  virtual  address  regions  to  node  numbers. 
These  mappings  may  be  changed  by  system  software. 
The  user  specifies  the  destination  of  a  message  with 
a  virtual  address,  which  the  network  output  interface 


hardware  uses  to  access  the  GTLB  and  calculate  the 
physical  destination  node. 

With  a  single  GTLB  entry,  a  range  of  virtual  ad¬ 
dresses  (called  a  page-group)  is  mapped  across  a  region 
of  processors.  In  order  to  simplify  encoding,  the  page- 
group  must  be  a  power  of  2  pages  in  size,  where  each 
page  is  1024  words.  The  mapped  processors  must  be 
in  a  contiguous  3-D  rectangular  region  with  a  power 
of  2  number  of  nodes  on  a  side.  This  information  is  en¬ 
coded  in  a  single  GTLB  entry  as  shown  in  Figure  8.  The 
virtual  page  field  is  used  as  the  tag  during  the  fully  as¬ 
sociative  GTLB  lookup.  The  starting  node  specifies  the 
coordinates  of  the  origin  of  the  region  of  mapped  pro¬ 
cessors,  while  the  extent  specifies  the  base  2  logarithm 
of  the  X,  Y,  and  Z  dimensions  of  the  region.  The  page- 
group  length  field  specifies  the  number  of  local  pages 
that  are  mapped  into  the  page  group.  The  pages-per- 
node  field  indicates  the  number  of  pages  placed  on  each 
consecutive  processor,  and  is  used  to  implement  a  spec¬ 
trum  of  block  and  cyclic  interleavings. 

Message  Reception:  At  the  destination  node,  an  ar¬ 
riving  message  is  automatically  placed  in  a  hardware 
message  queue.  The  head  of  the  message  queue  is 
mapped  to  a  register  accessible  by  an  H-Thread  (in 
either  cluster  2  or  3,  depending  on  message  priority) 
in  the  event  V-Thread.  The  message  dispatch  handler 
code  running  in  that  H-Thread  stalls  until  the  mes¬ 
sage  arrives,  and  then  dequeues  the  dispatch  instruc¬ 
tion  pointer  (DIP)  and  jumps  to  it.  This  starts  execu¬ 
tion  of  the  specific  handler  code  to  perform  the  action 
requested  in  the  message.  Some  of  the  actions  include 
remote  read,  remote  write,  and  remote  procedure  call. 
The  message  need  not  be  copied  to  or  from  memory,  as 
it  is  accessible  via  a  general  register.  In  order  to  avoid 
overflow  of  the  fixed  size  message  queue  and  back  up 
of  the  network,  only  short,  well-bounded  tasks  are  exe¬ 
cuted  by  message  handlers.  Longer  tasks  are  enqueued 
to  be  run  as  a  user  process  on  a  user  V-Thread. 

Protection:  The  M-Machine  communication  sub¬ 
strate  provides  fully  protected  user-level  access  to  the 
network.  The  SEND  instruction  atomically  launches  a 
message  into  the  network,  preventing  a  user  from  oc¬ 
cupying  the  network  output  indefinitely.  The  auto¬ 
matic  translation  provided  by  the  GLTB  ensures  that 
a  program  may  only  send  messages  to  virtual  addresses 
within  its  own  address  space.  Finally,  restricting  the 
set  of  user  accessible  DIPs  prevents  a  user  handler  from 
monopolizing  the  network  input.  If  an  illegal  DIP  is 
used,  a  fault  will  occur  on  the  sending  thread  before  the 
message  is  sent. 

Throttling:  In  order  to  prevent  a  processor  from  in¬ 
jecting  messages  at  a  rate  higher  than  they  can  be  con¬ 
sumed,  the  M— Machine  implements  a  return-io-sender 
throttling  protocol.  A  portion  of  a  local  node’s  memory 
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(a)  Message  Send 

LOAD  A[0]  ,  MCI 
SEND  Raddr ,  Rdip,  #1 


(b)  Message  Receive 

loop:  JMP  Rnet  ; 

;  start  of  remote  write  code 
MOVE  Rnet,  R1  ; 

STORE  Rnet,  R1  ; 

BRANCH  loop  ; 


load  A[0]  into  message  composition  register  1 
send  a  remote  store  message  to  the  processor 
containing  VA  Raddr,  with  a  1  word  body 


jump  to  DIP  (remote  write)  when  message  arrives 

move  virtual  address  into  R1 

store  the  body  word  of  the  message  into  memory 

branch  back  to  message  dispatch  code 


Figure  7:  Example  of  M-Machine  code  implementing  a  remote  store:  (a)  Sending  a  3  word  remote  store  message. 
( b )  Receiving  and  performing  the  store. 


Extent 


Virtual  Page 

Starting 

Node 

Page-group 

Length 

Pages / 
Node 

Z 

Y 

X 

42  bits 

16  bits 

6  bits 

6  bits 

3  bits  each 

Figure  8:  Format  of  a  Global  Destination  Table  (and  GTLB)  entry,  used  to  determine  a  physical  node  identifier 
from  a  virtual  address. 


is  used  for  returned  message  buffering.  When  a  mes¬ 
sage  is  sent,  a  counter  is  automatically  decremented, 
which  reserves  buffer  space  for  that  message,  should  it 
be  returned.  If  the  counter  is  zero,  no  buffer  space 
is  available  and  no  additional  messages  may  be  sent; 
threads  attempting  to  execute  a  SEND  instruction  will 
stall.  When  the  message  reaches  the  destination  a  re¬ 
ply  is  sent  indicating  whether  the  destination  was  able 
to  handle  the  message.  If  the  message  was  consumed, 
the  reply  instructs  the  source  processor  to  increment 
its  counter,  deallocating  the  buffer  space.  Otherwise, 
the  reply  contains  the  contents  of  the  original  message 
which  are  copied  into  the  buffer  and  resent  at  a  later 
time. 

Discussion:  The  M-Machine  provides  direct  register- 
to-register  communication,  avoiding  the  overhead  of 
memory  copying  at  both  the  sender  and  the  receiver, 
and  eliminating  the  dedicated  memory  for  message  ar¬ 
rival,  as  is  found  on  the  J-Machine  [6].  Register-mapped 
network  interfaces  have  been  used  previously  in  the  J- 
Machine  and  iWarp  [2],  and  have  been  described  by 
*T  [20]  as  well  as  Henry  and  Joerg  [11].  However,  none 
of  these  systems  provide  protection  for  user-level  mes¬ 
sages. 

Systems,  like  the  J-Machine,  that  provide  user  ac¬ 


cess  to  the  network  interface  without  atomicity  must 
temporarily  disable  interrupts  to  allow  the  sending  pro¬ 
cess  to  complete  the  message.  The  M-Machine  *s  atomic 
SEND  instruction  eliminates  this  requirement  at  the  cost 
of  limiting  message  length  to  the  number  of  cluster  reg¬ 
isters.  Most  messages  fit  easily  in  this  size  and  larger 
messages  can  be  packetized  and  reassembled  with  very 
low  overhead. 

Automatic  translation  of  virtual  processor  numbers 
to  physical  processor  identifiers  is  used  in  the  Cray 
T3D  [5].  The  use  of  virtual  addresses  as  message  desti¬ 
nations  in  the  M-Machine  has  two  advantages.  When 
combined  with  translation  hardware,  it  provides  protec¬ 
tion  for  user  initiated  messages,  without  incurring  the 
overhead  of  operating  system  invocation,  as  messages 
may  not  be  sent  to  processors  mapped  outside  of  the 
user’s  virtual  address  space.  It  also  facilitates  the  im¬ 
plementation  of  global  shared  memory.  The  interleav¬ 
ing  performed  by  the  GTLB,  although  not  as  versatile 
as  the  CRAY  T3D  address  centrifuge  or  the  interleaving 
of  the  RP3  [21],  provides  a  means  of  distributing  ranges 
of  the  address  space  across  a  region  of  nodes. 

In  contrast  to  both  *T  and  FLASH  [14]  which  use  a 
separate  communication  coprocessor  for  receiving  mes¬ 
sages,  the  M-Machine  incorporates  that  function  on  its 
already  existing  execution  resources,  an  H-Thread  in 
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the  event  V-Thread.  This  avoids  idling  resources  as¬ 
sociated  with  a  dedicated  processor.  During  periods  of 
few  messages,  user  threads  may  make  full  use  of  the 
cluster’s  arithmetic  and  memory  bandwidth. 

4.2  Non-Cached  Shared  Memory 

Fast  access  to  remote  memory  is  provided  transparently 
to  the  user  with  a  combination  of  hardware  and  software 
mechanisms.  When  a  load  or  store  operation  causes  a 
Local  Translation  Lookaside  Buffer  (LTLB)  miss,  a  soft¬ 
ware  trap  is  signalled.  Like  the  hardware  dedicated  to 
message  arrival,  one  H-Thread  in  the  event  V-Thread 
is  reserved  for  handling  LTLB  misses.  The  LTLB  miss 
handler  code  probes  the  GTLB  to  determine  where  the 
requested  data  is  located,  and  if  necessary,  sends  a  mes¬ 
sage  to  the  destination  node.  If  the  data  is  in  fact  local, 
the  LTLB  miss  handler  fetches  the  required  page  table 
entry  and  places  it  in  the  LTLB.  Using  a  small  portion 
of  the  execution  resources  for  fast  trap  handling  reduces 
the  latency  of  both  local  LTLB  misses  and  remote  data 
access. 

The  sequence  of  operations  required  to  satisfy  a  re¬ 
mote  memory  load  is  shown  below.  The  labels  HW  and 
SW  indicate  whether  the  action  is  performed  by  hard¬ 
ware  or  software. 

1.  HW:  Memory  operation  accesses  the  cache  and 
misses  (2  cycles). 

2.  HW:  LTLB  miss  occurs,  enqueueing  an  event  (2 
cycles). 

3.  SW:  Software  accesses  the  local  page  table  (LPT), 
probes  the  GTLB,  and  composes  and  sends  a 
message  containing  the  referenced  and  return  ad¬ 
dresses  (48  cycles). 

4.  HW:  Message  delivered  to  remote  node  (5  cycles). 

5.  SW:  Message  handler  fetches  requested  data  from 
memory,  formats  a  reply  message,  and  sends  it  (29 
cycles). 

6.  HW:  Return  message  delivered  (5  cycles). 

7.  SW:  Message  handler  decodes  the  original  load 
destination  register  and  writes  the  data  directly 
there  (41  cycles). 

Timelines  for  both  remote  read  and  write  accesses 
are  shown  in  Figure  9.  These  measurements  are  esti¬ 
mates  based  on  prototype  message  and  event  handlers 
running  on  the  M-Machine  simulator.  A  user  level  pro¬ 
gram  running  on  node  0  makes  read  and  write  requests 
to  memory  on  neighboring  node  1.  Except  for  the  mes¬ 
sage  handler  that  runs  on  demand,  node  1  is  idle.  All 
references  to  memory  system  data  structures  in  the  soft¬ 
ware  handlers  are  assumed  to  cache  hit. 

Table  1  shows  a  comparison  of  preliminary  results  of 
local  and  remote  access  latencies  (in  cycles).  A  read 
is  completed  when  the  requested  data  has  been  writ¬ 
ten  into  the  destination  register.  A  write  is  completed 


Access  Times  ( cycles ) 

Access  Type 

READ 

WRITE 

Local  Cache  Hit 

3 

2 

Local  Cache  Miss 

13 

19 

Local  LTLB  Miss 

61 

67 

Remote  Cache  Hit 

138 

74 

Remote  Cache  Miss 

154 

90 

Remote  LTLB  Miss 

202 

138 

Table  1:  Comparison  of  local  and  remote  access  times, 
assuming  no  resource  contention. 


when  the  line  containing  the  data  has  been  fully  loaded 
into  the  cache.  The  remote  read  and  write  accesses  are 
larger  than  their  local  counterparts  due  to  the  software 
intervention  required  to  send  the  message  to  the  remote 
node.  However,  the  time  to  perform  a  remote  read  that 
hits  in  the  cache  is  only  a  twice  as  large  as  a  local  read 
that  requires  software  intervention  (LTLB  miss).  For 
the  remote  write,  which  does  not  require  return  data, 
the  difference  is  only  10%. 

4.3  Caching  and  Coherence 

Even  though  remote  accesses  are  fast,  their  latency  is 
still  large  compared  to  local  memory  references.  This 
overhead  reduces  the  ability  of  the  map  to  use  the  net¬ 
work  and  remote  memory  bandwidth  effectively.  To 
reduce  overall  latency  and  improve  bandwidth  utiliza¬ 
tion,  each  M-Machine  node  may  use  its  local  memory 
to  cache  data  from  remote  nodes. 

In  addition  to  the  virtual  to  physical  mapping,  each 
LTLB  (and  LPT)  entry  contains  2  status  bits  for  each 
cache  block  in  the  page.  These  block  status  bits  are  used 
to  provide  fine  grained  control  over  8  word  blocks,  al¬ 
lowing  different  blocks  within  the  same  mapped  page 
to  be  in  different  states.  This  fine  grained  control  over 
data  is  similar  to  that  provided  in  hardware  based  cache 
coherent  multiprocessors,  and  alleviates  the  false  shar¬ 
ing  that  exists  in  other  software  data  coherence  sys¬ 
tems  [16].  The  two  block  status  bits  are  used  to  encode 
the  following  four  states: 

•  INVALID:  The  block  may  not  be  read,  written,  or 
placed  in  the  hardware  cache. 

•  READ-ONLY:  The  block  may  be  read,  but  not  writ¬ 
ten. 

•  READ/WRITE:  The  block  may  be  read  or  written. 

•  DIRTY:  The  block  may  be  read  or  written,  and  it 
has  been  written  since  being  copied  to  the  local 
node. 

One  software  policy  that  uses  the  block  status  bits 
fetches  remote  cache  blocks  on  demand.  When  a  mem¬ 
ory  reference  occurs,  the  block  status  bits  corresponding 
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Figure  9:  Timeline  for  remote  read  and  write  accesses. 


to  the  global  virtual  address  are  checked  in  hardware.  If 
the  attempted  operation  is  not  allowed  by  the  state  of 
the  block,  a  software  trap  called  a  block  status  fault  oc¬ 
curs.  The  trap  code  runs  in  the  event  V-Thread,  in  the 
H-Thread  that  is  reserved  for  handling  block  status  and 
synchronization  events.  The  block  status  handler  sends 
a  message  to  the  home  node,  which  can  be  determined 
using  the  GTLB,  requesting  the  cache  block  containing 
the  data.  The  home  node  logs  the  requesting  node  in  a 
software  managed  directory  and  sends  the  block  back. 
When  the  block  is  received,  the  data  is  written  to  mem¬ 
ory  and  the  block  status  bits  are  marked  valid.  If  the 
virtual  page  containing  the  block  is  not  mapped  to  a 
local  physical  page,  a  new  page  table  entry  is  created 
and  only  the  newly  arrived  block  is  marked  valid.  The 
remote  data  may  be  loaded  into  the  on-chip  cache,  and 
modifications  to  the  data  will  automatically  mark  the 
block  state  dirty.  More  complex  coherence  schemes  can 
map  blocks  from  different  virtual  pages  into  the  same 
physical  page,  reducing  the  amount  of  unmapped  phys¬ 
ical  memory. 

The  software  handlers  used  to  transmit  data  from 
node  to  node  may  implement  a  variety  of  coherence 
policies  and  protocols.  This  code  is  easily  incorporated 
within  the  remote  read  and  write  handlers  described  in 
Section  4.2.  Using  local  memory  as  a  repository  will 
allow  remote  data  to  be  cached  locally  beyond  the  ca¬ 
pacity  of  the  local  on-chip  cache  alone. 

Discussion:  Directory-based,  cache  coherent  multi¬ 
processors  such  as  Alewife  [1]  and  DASH  [15]  implement 
coherence  policies  in  hardware.  This  improves  perfor¬ 


mance  at  the  cost  of  flexibility.  Like  the  M-Machine, 
FLASH  [14]  implements  remote  memory  access  and 
cache  coherence  in  software,  but  uses  a  coprocessor. 
However,  this  system  does  not  provide  block  status  bits 
in  the  TLB  to  support  caching  remote  data  in  DRAM. 
The  subpage  status  bits  of  the  KSR-1  [7]  perform  a 
function  similar  to  that  of  the  block  status  bits  of  the 
M-Machine. 

Implementing  a  remote  memory  access  and  coher¬ 
ence  completely  in  software  on  a  conventional  processor 
would  involve  delays  much  greater  than  those  shown  in 
Table  1  as  evidenced  by  experience  with  the  Ivy  system 
[16].  The  M-Machine’s  fast  exception  handling  in  a  ded¬ 
icated  H-Thread  avoids  the  delay  associated  with  con¬ 
text  switching  and  allows  the  user  thread  to  execute  in 
parallel  with  the  exception  handler.  The  GTLB  avoids 
the  overhead  of  manual  translation  and  the  cost  of  a  sys¬ 
tem  call  to  access  the  network.  Finally,  the  M-Machine 
provides  memory- mapped  addressing  of  thread  registers 
to  allow  the  operation  to  be  completed  in  software. 

The  major  contributors  to  remote  access  latency  in 
the  M— Machine  are  the  search  for  the  faulting  address 
in  the  local  page  table  and  decoding  the  reply  message 
(about  40  cycles  each).  The  page-table  overhead  is  only 
incurred  when  accessing  the  first  block  of  a  page.  Access 
to  subsequent  blocks  cause  block-status  faults  (rather 
than  page  faults)  which  skip  the  page-table  access.  The 
reply  decode  could  be  accelerated  by  prohibiting  the 
faulting  V-Thread  from  swapping  out  during  the  mem¬ 
ory  operation.  However,  this  would  complicate  schedul¬ 
ing  and  remote  handling  of  potentially  long  latency  syn- 
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chronizing  memory  operations. 

5  Conclusion 

In  this  paper  we  have  described  the  architecture  of  the 
M-Machine  with  an  emphasis  on  its  novel  features.  The 
M-Machine  is  a  3-D  mesh,  each  node  of  which  contains 
a  multi- ALU  processor  (map)  and  8  MBytes  of  syn¬ 
chronous  DRAM.  Each  MAP  chip  consists  of  four  64-bit 
3-issue  clusters  connected  by  a  cluster  switch,  a  4- way 
interleaved  on-chip  cache,  an  external  memory  interface, 
and  on-chip  network  interfaces  and  routers. 

Instruction  level  parallelism  is  exploited  both  within 
a  cluster  and  across  clusters  using  H-Threads.  An 
H-Thread  may  communicate  and  synchronize  through 
registers  with  H-Threads  on  different  clusters  but 
within  the  same  V-Thread.  A  27  point  stencil  com¬ 
putation  on  4  H-Threads  (12-wide  issue)  has  a  static 
instruction  length  half  that  of  1  H-Thread  (3-wide  is¬ 
sue). 

To  increase  use  of  the  local  memory  and  execution 
bandwidth,  multiple  tasks,  called  V-Threads,  are  inter¬ 
leaved  on  a  cycle-by-cycle  basis  independently  on  each 
of  the  clusters.  Each  cycle,  a  different  thread  may  be 
selected  for  execution,  or  if  only  one  V-Thread  is  res¬ 
ident,  it  may  issue  an  instruction  every  cycle  on  each 
cluster. 

The  M-Machine  has  a  user-level,  protected,  fast  mes¬ 
sage  passing  substrate  to  reduce  communication  and  re¬ 
mote  memory  latencies.  Messages  are  composed  in  gen¬ 
eral  registers  and  sent  via  a  user  level  SEHD  instruction. 
Arriving  messages  are  extracted  by  a  system-level  soft¬ 
ware  message  dispatch  handler,  which  is  always  resident 
in  the  event  V-Thread.  The  message  contents  are  ac¬ 
cessed  via  a  register  mapped  queue.  The  message  need 
not  be  copied  to  or  from  memory  on  either  the  sending 
or  receiving  side.  Two  level  translation  is  used  to  inde¬ 
pendently  relocate  objects  in  the  physical  address  space 
on  a  node,  and  in  the  processor  namespace. 

The  fast  message  system  is  used  to  provide  the  user 
with  transparent  access  to  remote  memory.  When  a 
user’s  load  or  store  instruction  traps  to  software  on  a 
LTLB  miss,  a  message  is  sent  to  a  remote  node  to  per¬ 
form  the  access.  While  slower  than  local  accesses,  a  re¬ 
mote  load  can  be  satisfied  in  138  cycles,  while  a  remote 
store  can  be  satisfied  in  74  cycles.  In  order  to  facili¬ 
tate  local  caching  of  remote  data,  2  status  bits  for  each 
block  (8  words)  in  a  page  are  added  to  the  LTLB  and 
page  table  entries.  When  an  invalid  block  is  accessed,  a 
trap  to  software  occurs  which  can  retrieve  the  missing 
block  from  a  remote  node,  copy  it  into  local  memory, 
and  mark  the  status  bits  valid. 

A  cycle-accurate  simulator  of  the  M-Machine  has 
been  completed  and  is  being  used  for  software  develop¬ 
ment.  M-Machine  software  is  being  designed  and  imple¬ 
mented  jointly  with  the  Scalable  Concurrent  Program¬ 
ming  group  at  Caltech.  The  Multiflow  compiler  [17]  is 


being  ported  to  the  M-Machine  to  generate  long  instruc¬ 
tions  spanning  multiple  clusters.  It  is  currently  able  to 
generate  code  for  a  single  cluster.  A  prototype  runtime 
system  consisting  of  primitive  message  and  event  han¬ 
dlers  has  also  been  implemented.  The  hardware  design 
of  the  MAP  is  currently  underway;  80%  of  the  modules 
have  been  designed  at  the  RTL  level  and  some  layout 
has  begun.  The  MAP  will  be  implemented  on  a  single  in¬ 
tegrated  circuit  with  a  projected  area  of  17mm  x  18mm 
in  0.5 \im  CMOS  with  5  metal  layers.  Tapeout  is  ex¬ 
pected  in  1996. 

The  M-Machine  addresses  the  issues  of  non-uniform 
technology  scaling  and  of  programmability.  By  chang¬ 
ing  the  ratio  of  processor  to  memory  area,  the 
M-Machine  better  balances  cost  and  improves  the  uti¬ 
lization  of  the  increasingly  critical  memory  bandwidth. 
The  M-Machine  increases  the  ratio  of  processor  to  mem¬ 
ory  silicon  area  to  11%  from  0.13%  for  a  typical  1996  sys¬ 
tem.  A  32-node  (128  clusters)  M-Machine  with  a  total 
of  256  MBytes  of  memory  requires  50%  more  area  than  a 
uniprocessor  with  the  same  amount  of  memory  but  pro¬ 
vides  128  times  as  much  peak  performance,  a  85:1  im¬ 
provement  in  peak-performance/area.  This  increase  in 
processing  resources  allows  the  M-Machine  to  saturate 
the  costly  DRAM  bandwidth  even  for  problems  with 
good  locality  and  thus  runs  programs  faster  allowing  a 
fixed-size  memory  system  to  run  more  programs  per  unit 
time.  The  85:1  improvement  in  peak-performance/ area 
makes  the  increased  parallelism  of  the  M-Machine  cost 
effective  even  in  cases  where  only  a  small  fraction  of  its 
peak  performance  is  realized. 

The  M-Machine  addresses  the  problem  of  paral¬ 
lel  software  by  supporting  an  incremental  approach  to 
parallelization.  Unlike  conventional  parallel  machines, 
the  M-Machine  can  efficiently  run  a  sequential  pro¬ 
gram  that  uses  all  the  machine’s  memory,  including 
that  on  remote  nodes.  A  shared  address  space,  high- 
performance  messaging,  and  caching  remote  data  in  lo¬ 
cal  DRAM  provide  fast  access  to  remote  data.  The  se¬ 
quential  program  can  then  be  divided  into  tasks,  such 
as  loop  iterations  or  subroutines,  that  can  be  executed 
in  parallel.  The  ability  to  support  fine-grain  paral¬ 
lelism  increases  the  number  of  suitable  tasks  and  al¬ 
lows  extraction  of  more  parallelism  from  small  prob¬ 
lems.  Support  for  synchronizing  memory  operations  and 
global  addressing  simplifies  user-level  communication 
and  synchronization  between  tasks  and  reduces  over¬ 
head.  Caching  in  DRAM  automates  much  of  the  data 
placement  and  migration  problem.  For  the  cases  where 
a  programmer  wants  to  extract  the  maximum  perfor¬ 
mance,  fast,  protected,  user-level  messages  may  be  em¬ 
ployed. 

We  expect  that  the  architecture  concepts  demon¬ 
strated  in  the  M-Machine  will  be  useful  in  machines 
ranging  from  single-node  personal  computers,  through 
workstations  with  tens  of  nodes,  to  servers  with  hun- 
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dreds  to  thousands  of  nodes.  Memory  bandwidth  and 
capacity  are  becoming  the  dominant  factor  in  the  cost 
and  performance  of  systems  of  all  scales.  By  chang¬ 
ing  the  processor/ memory  ratio,  providing  methods  for 
extracting  parallelism  at  all  levels,  and  supporting  an 
incremental  approach  to  parallelism,  the  M-Machine’s 
mechanisms  will  lead  to  more  cost  effective  and  pro¬ 
grammable  machines  across  the  price-performance  spec¬ 
trum. 
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